In the class, we have learned simple join. Sometimes, for some of the rows you are joining, you may not be able to find a matching records in another table. In that case, you still want to keep the records from one table after JOIN. In this lab-problem set, we will learn how to do that.

The following figure shows different types of joins, and how to run the query for each type.

joins

joins

Curtailed FacebookDB

library(DBI)
# create database: this will create a file in our hard drive
db <- dbConnect(RSQLite::SQLite(), "~/Data/my472/facebook-db-mini.sqlite")
# reading the first file
congress <- read.csv("~/Data/my472/congress-facebook-2017.csv",
    stringsAsFactors=F)

set.seed(20181128)
congress <- congress[sample(nrow(congress), 100), ]
# adding first table: user-level data
dbWriteTable(db, "congress", congress)
# testing that it works with a simple query
dbListFields(db, "congress")
dbGetQuery(db, 'SELECT * FROM congress LIMIT 5')
fls <- list.files("~/Data/my472/posts", full.names=TRUE)
set.seed(20181128)

for (f in fls){
  
  message(f)
  # read file into memory
  fb <- read.csv(f, stringsAsFactors=F)
  fb <- fb[sample(nrow(fb), size = nrow(fb)/20), ]
  
  # adding to table in SQL database
  dbWriteTable(db, "posts", fb, append=TRUE)
  
}
# testing that it works
dbListFields(db, "posts")
dbGetQuery(db, 'SELECT * FROM posts LIMIT 5')
# what if we make a mistake and want to remove the table?
# dbRemoveTable(db, "posts")
# and we close the connection for now
dbDisconnect(db)

Questions

library(DBI)
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.0.0     ✔ purrr   0.2.5
## ✔ tibble  1.4.2     ✔ dplyr   0.7.6
## ✔ tidyr   0.8.1     ✔ stringr 1.3.1
## ✔ readr   1.1.1     ✔ forcats 0.3.0
## ── Conflicts ──────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
db <- dbConnect(RSQLite::SQLite(), "~/Data/my472/facebook-db-mini.sqlite")
  1. How many rows in each table
dbGetQuery(db, 
"SELECT COUNT(*) FROM posts")
##   COUNT(*)
## 1     8530
dbGetQuery(db, 
"SELECT COUNT(*) FROM congress")
##   COUNT(*)
## 1      100
  1. How many rows by joining posts with congress?
dbGetQuery(db, 
"SELECT COUNT(*) FROM posts AS p
JOIN congress AS c
ON c.screen_name = p.screen_name
")
##   COUNT(*)
## 1     1655
  1. How many rows by left-joining posts with congress?
dbGetQuery(db, 
"SELECT COUNT(*) FROM posts AS p
LEFT JOIN congress AS c
ON c.screen_name = p.screen_name
")
##   COUNT(*)
## 1     8530
  1. What happens if you try right join?
msg <- try(
dbGetQuery(db, 
"SELECT COUNT(*) FROM posts AS p
RIGHT JOIN congress AS c
ON c.screen_name = p.screen_name
"))
message(msg)
## Error in result_create(conn@ptr, statement) : 
##   RIGHT and FULL OUTER JOINs are not currently supported
  1. Try left outer join with IS NULL condition
dbGetQuery(db, 
"SELECT COUNT(*) FROM posts AS p
LEFT JOIN congress AS c
ON c.screen_name = p.screen_name
WHERE c.bioid IS NULL
")
##   COUNT(*)
## 1     6875
dbDisconnect(db)

Subquery

In SQL, you can run a nested query where you use a result from the first query as a condition for the second query. Where we find a top 3 most frequent posters in the congress, and then get all messages from them.

#connect to the full db
db <- dbConnect(RSQLite::SQLite(), "~/Data/my472/facebook-db.sqlite")
  1. What are the screen names of the top 3 posters?
dbGetQuery(db, 
"SELECT screen_name, COUNT(*) AS post_count
FROM posts
GROUP BY screen_name
ORDER BY post_count DESC
LIMIT 3
")
##         screen_name post_count
## 1 CongressmanNadler       1520
## 2     HurdOnTheHill       1500
## 3   RepMikeThompson       1494
dbGetQuery(db, 
"SELECT screen_name
FROM posts
GROUP BY screen_name
ORDER BY COUNT(*) DESC
LIMIT 3
")
##         screen_name
## 1 CongressmanNadler
## 2     HurdOnTheHill
## 3   RepMikeThompson
  1. (Without subquery) how to get the posts from these congress members?
library(DT)
dbGetQuery(db, 
"SELECT *
FROM posts
WHERE screen_name IN ('CongressmanNadler', 'HurdOnTheHill', 'RepMikeThompson')
") %>% 
  datatable()
## Warning in instance$preRenderHook(instance): It seems your data is too
## big for client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html
  1. Now, instead of hardcoding the screen_names in the conditional statement, use the query results in #1 as the subquery.
dbGetQuery(db, 
"SELECT *
FROM posts
WHERE screen_name IN (SELECT screen_name
    FROM posts
    GROUP BY screen_name
    ORDER BY COUNT(*) DESC
    LIMIT 3)
") %>% 
  datatable()
## Warning in instance$preRenderHook(instance): It seems your data is too
## big for client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html